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instructional improvement. Psychometric and multilevel educational 
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differences in test performance can be used^ to detect differences in 
content, sequence and quality of instruction. Patterns of data 
responses from a standardized testing program conducted within a 
school district in the Beginning Teacher Evaluation Study (BTES) were, 
examined. Between-group indices of the distribution of performance by 
class, school and ethnip background were generated under alternate 
r<iles for' content classification. The subsets of .group sensitive 
items and between-class correlation of items with 'instructional 
variables were used to build a scale sensitive to the variable used 
to select items.^ Item response patterns were related to group 
membership and instructional coverage. The results supported the' 
arguroen-t, that tests can be constructed in multilevel approaches 
sensitLSH^e to desired group characteristics for information about 
instructional experience differences.' (CM) 
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. Empirical Studies of Multilevel Approaches to ' 
Test Developmei>t and Interpretation 

Review and 'Rationale 

-4 ' I, * ' • 

During the past several years, CSE personnel haye been working on 
'the applicability, of multilevel methods to test deyelopTnent and inter- 
pretation. An Initial report (Miller & Burstein, 1979) detaiVing-concep- 
tual models for applying multilevel analysis principles tp test development 
aad interpretation was submitted J*n November 1979. However, it was clear 
^that we had only begun to scratch the surface of this problem. 
Moreover, the pt*oblem appeared sufficiently important in a number 
of educational contexts to Warrant further attention. ^ 

Instructional Sensitivity of Tests . The impetus for the work on 
multilevel approaches to test development and interpretation ds the 
increasing concern about the instructional sensitivity of standardized | 
achievement tests. This concern derives from several aspects of current 
thinking about such testing. First, there is support for the notion ■ 
that test performance is high when there is substantial overlap between 
the'content of the test and the content of instruction (e.g. ,. Armbruster 
et al., 1977; Jenkins & Pany, 1976; Leinhardt & Seewald, 1980; Madatjs 
et al.,,1979; Walker & .Schaffarzik, 1974). Given this connection, the 
evidence of wide variation"" in content coverage in the major standardized ^ 
achievement tests (Porter et al., 1978) raises the question "of whether 
schools have carefully selected the test which best fits their curricu- 
lum (and whether this is even possible in a district with many schoo.ls). 
' Second, researchers from diverse viewpoints have argued that while the 
broad: spectrum of standardized achievement tests may be useful indicators 
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for illuminating state and national policies, these tests are insensitive 
to instructional or program effects (Airasian & Madaus, 1976, 1980; 
Berliner, 1978; Carver, 1974, 1975; Hanson^ &^chutz, 1978; Madaus et ^ 
al., 1979, 1980; Porter et al . , 1978). 

The weak evidence of schooling and program effects (Averch et al., 
1972; Coleman et al., 1966; Stebbins et al . , 1977) in the face of 
strong beliefs that students do leam fVom given school and program 
experiences is largely responsible for current challenges to ^he 
instructional and program relevance of standardized achievement tests. 
The challenges from researchers knowledgeable about classroom practices 
and processes are based on the argument that as long as teachers have 
the freedom to choose areas of coverage and emphasis, tests cannot be 
expected to have relevance for all classrooms. Curriculum developers 
offer similar reasons for suggesting that tests are not appropriate to 
the^content of their curricula. While these arguments have intrinsic 
merit, they raise as -many questions about the appropriateness of instruc- 
tional coverage decisions by teachers and curriculum developers a.s they 
do about the utility of the tests for measuring sjcills that should be 
part of the repertoire of the nation's students. 

These concerns about the instructional sensitivity and program 
relevance of nocm-referenced achievement tests have caused some educational 
researchers and practitioners to turn to criterion referenced measurement 
Je.g., see Berk, 1980; Baker, Linn, & Quellmalz, 1980; Harris, Alkin, 
& Popham, 1974; Popham, 1978). When looking at a single program with • 
common goals, objectives, and curriculum coverage, criterian-re*ferenced 
tests can provide a better measure of the quality of instruction when 
targeted to the specific goals and objectives of the program. However, 



once a study shifts from a single uniform program to examine multiple 

groups (e.g., classroom or school) that may share a coninon general goal 

but approach it differently (e.g., different specific instructional ^ 

objectives, .different sequencing, or different relative emphasis across 

objectives), trouble arises in trying to develop criterion-referenced 

tests, both specific to the progrcim^of each group (classroom or school) 

a-nd yet general enough for comparisons across groups. One alternative is 

to build critep on-referenced measures that contain all the objectives 

of all the programs. But this strategy can rapidly become unwieldy 

because the differences between programs generate too much material to 

test. Furthermore, when jome programs cover more objectives than another, 

they are still at an advantage because there are fewer novel topics 

covered on the exam. ' - 

» 

Given the problems with using criterion-referenced tests to measure 
differences between groups which differ in instructional objectives 
and/or approaches, it is not surprising that norm-referenced tests con- 
tinue to be used for cross-program (school of classroom) comparisons, 
especially when they are judged to adequately cover (at least at some level 
of generality) the connion part of the curriculum. The challenge is to 
insure that whatever measures are used to judge impact are sufficiently 
sensitive to differences in programs and instructional groups. Since 
standardized tests are at present the primary evidence for such judgments, 
the extent to which they perform their desired function warrants attention. 

Measuring Programs As Well As Students . There is a perhaps too 
subtle shift in emphasis implicit in our. concerns about the instructional 

and program relevance of measures of , student performance. The rationale 

/ 

for the current investigation might instead be viewed as part of a shift 

u 
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in the conception of the purpose for* standardized achievement testing 
* in education. A traditional conception would clearly emphasize obtaining 
a description (measure) of what students know and how their knowledge 
compares with that of a relev'ant group (classmates, same school, same 
grade level, publ ishers' norms, etc,)* The same rationale holds whether 
one is talking about norm-referenced or criterion-referenced measurements 
though with the latter, both the degree of specificity of the pertinent 
body of knowledge and the nature of the compartson (to a given level of 
performance within the domain of knowledge reflected in the test) are 
changed. Measuring what students know is^ still the primary concern. 
This individualistic conception of achievement measurement served 

well as long as the measures' of performance wersu intended only to helo reach 

/ / 

decisions about indivic^iTals (e.g., Does the student have the necessary 

background knowledge /or Algebra II? Who should be selected for an 

y 

academic sch^^larship? Which students need remedial instr^uction in reading?- 
Shoul^^he student be advanced to the next objective or spend additional 
time on the ones already studied?). While the level of generality 
required in dividing performance measures into content domains might 
vary depending an the specific circumstances (see Baker, 1981), that 
the 'decisions are being made about individuals is still the dominant 
feature of thris kind of achievement measurement, not whether the tests 
are "norm or criterion referenced. 

At a simpler period in our history when American citizens were less 
mobile and mor^ homogeneous, school "systems'' were smaller, fewer students 
advanced to e^ch higher level of the educational system, and there was less to 
be learned and a greater consensus (folklore) on instructional content and 
method, operating by a strictly individualistic conception of achievement 
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measurement may hav6 been the propenrole for testing in schools. However, 
the growth in .the diversity of modern American society, with the accom- 
panying expansion of the educational level of the citzenry, the information 
and knowledge-to be learned, the centralization of schools Into larger 
school systems and the broadening of the array of curriculum and instruc- 
tional alternatives, raises questions about the adequacy of purely indivi- 
dualistic models of achievement testing for meeting the changing organization 
operations and -needs of American education. 

Under present conditions in education,- then, it seems particularly 
appropriate to delineate an additional conception of the purpose of 
achievement testing. This conception emphasizes the role of performance 
•on achievement tests as measures of the quality of the student's educational 
experiences . Under this conception, the focus shifts from obtaining a 
status assessment of the individual student to an examination of whether 
students coming from given educational programs have obtained certain 
levels of knowledge. The focus is no longer strictly on the student; 
the school system through its choice of programs in which to participate, 
through the curriculum decisions about what to teach, throflgh the specific 
instructional activities of individual te^achers and through the coordination 
of these activities among teachers (both* at the same and at different 
grade levels or subject matters) in the same school and district is 
viewed as having a direct responsibility to accomplish its educational 
goals for its students and is held accountable by the public for its 
actions! Decisions about programs (e.g.. How does the, performance of 
students In the pull-out program compare to performance in mainstreamed 
instruction with more educational assistance in the classroom? Is the 
special tutorial program enhancing student learning?) and instruction . 



(e.g.. Are students in school (classroom) A showing sufficient educational 
progress? Are students in classroom A which uses textbook Q learning 
the same things (and as well) as students in /other classes using textbook 
W? Does the bocjy of knowledge taught students in grade M in school B 
prepare them adequately for the instruction planned in grade M+1? . 
Which instructional topics need further study to bring students in class 
(school) P up to an acceptable performance level?) are emphasized in 
addition to concerns about individual learners. 

This conception of testing as means to examine the results of 
edcuational programs is in lin^ with the concerns of researchers and 
policy-makers interested in measuring program and schooling effects.^ 
More importantly, we argue that this view of achievement testing is 
consonant with current emphasis on linking testing and instruction in 
schools and on systemic efforts at program and instructional improvement. 

It is also clear that this conception places greater emphasis on the 

\ 

aggregation of test scores across students within classrooms, schools^ 
programs, districts, etc., in order to provide information in a form that 
is more directly relevant to program and instructional decision-making 
than strictly student level data would. . 

Psychometric Considerations . Given a concern for measuring program 
and instructional differences as well as itidividual differences, the 
complaints about the traditional psychometric basis for standardized ^ 
test construction are well- taken. While these tests have been used to • 
assess the achievement or ability differences among individuals, as 
well as ranking the achievement differences among iiggregates of individuals 
(e.g., classes or schoo>s), .the psychometric model . used in test construc- 
tion has focused primarily upon the former. Some critics have argued 
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that tests designed to differentiate among Individuals maximize the 
within-school differences relative to the b§tween-school or between- 
program differences, (Airasian & Madaus, 1980; Carver, 1974, 1975; 
Lewy, 1973; Madaus et al .", 1980). 

Theoretically, of course, there is^o reason to assume tha^t a test 
designed to measure individual differences cannot also rrfeasure school 
or program differences. Hovfever, the bulk of the evidence from school ^ 
effectiveness studies seems to suggest that either school or program 
differences da not exist jDr we 'are measuring the differences improperly 
(Madaus et al . , 1980). ^ ' 

' . Mutti level Cons jjJerat ions . The concerns cited above seem to reflect 
the same units jof .treatment and analysis issues which underly much 
of the recent work vn analysis of multilevel educational, data (Barr 
& Dreebea, 1977, 1981; Burstein, 1980a, 1986b^ Cooley, Bond, and Mao, 
1981;'Cronbach,^ 1976; Wittrock & Wiley, 1970). Cronbach (1976) directly 
addressed the units of analysis implication's for test construction and 
-Interpretation and a few studies (e.g., Airasian Madaus, 1976; Lewy, 
1973; Madaus, Rakow, Kellaghan, & King, 1980; Rakow,' Airasian & Madaus, 
1978) have sought to use test data from multiple" levels- to reflect 
schooling and program effects. These efforts barely hint at the 
[JDssibilities, however. • ^ 

We argue that multilevel examinations of test item data have, the ^ ^ 
potential to lead to better informed test development^ analysis , anter-' 
pretation, and reporting procedures. For example, careful investigations 
of test item data might enable one to identify effects due to ^background 
differences (e.g., prior learning, sex, socioeconomic and demographic 
differences), instructional coverage -and emphasis, and inst^ructional " / 
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organization (e.g., grouping and pacing effects) .* If these s:eparate , . 

. ef.fe^cts can identified, it would then be possible /or school personnel, • . 
to reconstruct from item data, a variety pf composites which ajje. potential ly . 
sensitive tb the-eontext factors pf their ctwosing. Likewise, te^st \ 
developers could include in their test development activities and pro- 
cedures which would guard against unknowingly selecting items influenced 
by "irrelevant" context and situational characteristics (where "Irrelevancy" 

. is determined by the purposes for which/the test would be used), *At the 
least, develDpers would be better able to describe the properties of 
their tests after carrying out a multilevel examination of their properties^. 

Our activities under the present grant period were directed to 
.identifying analytical methods which can distinguish the affects, of 
various factors that affect between-group (class, school) and within- 
group test performance.' It was expected that such a multilevel examination 
would facilitate the use of test data in program .and instructional decision- • 
making at various levels of the educational system. Hopefully, the , ^ 
analytical strategies are equally applicable to tests developed for either 
norm-referenced or criterion-referenced usage.. : 

• ■ • / ^ - ^ 

Methods ... 

. The act^ial empirical investigation undertaken focused on 'two general 
approaches for measuring betwe'en-group (classroom, school, program, etc.) 
differences in test perforftiance. Both approaches consider the empirical 
characteristics of betwe^n-group performance on test items or subsets 
of test items. ' / V 

Ijivesti^ations at a level below the total test are considered essential 
to detect differences in the content, sequencing, and quality of instruction. 
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Since one 1s seldom interested in the consequences of no math instruc- 
tion (vecsus some),' but i< often interested in the cfioioe betvyeen time 
"spent on and methods used, in developing, say, computationa'l skills, one 

• * * 

•is likely to itiiss reVevaat differences in the qffects of instruction by 

'considering only total t^t scores*., 

Desirable^vs> Availal)le Study Charactenistics . The practical 
scenario^that guided our empirical inquiry^ was an examination of the 
data from a standardized testing program conduction within a school 

'distrW^.^ Ideally at any given grade level, these data would be available 
at the item- level for students within^a number of classrooms within the 
district's schools.- Under these circumstances, the student responses^' 
t.o individual test items can be both vertically aggregated (instructional 
froups yyithin .classrooms, cjassrooms within schools, schools within the 
district) "as well as demographic groups (e.g., males vs. females, mono- 

•linguaV'vs. bilfVigual students, different demographic groups), and' horizon- 
tally aggregated (across^ items within a narrow^omain, to the level of 
instructionaT ^nits, at" the typical subtest leveT on achievement tests, 
as well as specific combinations of subte'sts and other classifications 
^ items (e.g, according, to process' being tested, linguistic features, 
task structure, 'etc.)) to obtain' the desired specificity of- information^ ^ 
about program and instructional differences • Thus, an investigator • 

"would be able to generate Ipdices-ot the distribution of test performance " 
for ^'a variety of grouping^ of' students (by class, schooj, ethn^icfgroup, V 
etc.) under jiVternative rules for contenKclassification. 

' The empirical work was conducted -on daxa^rom the Beginhitig Teacher 

revaluation Study (BTES; Fisher, Fllby, Marliave, Cahen< Dishaw, Moore, & 
Berliner, 1978). The primary data setlcontains test performance- of 
125 fifth-graders (approximately '6 stuci^nt's from egch of 22' classrooms 1 



ron the fifteen fraction 'items from the BTE^tes't battery. Tha. fractions v, 

subtest was administered on three occas/ons — ^prior.to any significant 

amount of fractions instruction (occasion B, December), near the end of 
» 

the school year (Occasion C'May), and again the following October (occasion 
D). Fractions waS' chosen because of its predominance in fifth grade 
mathematics instruction.. 

The six, students in each cla-ssroom selected for intensive study, 
$;Cored between the 30th atHL6^lth percentile on a beginning-of-the- 
;year prediction battery given to all the students from the 22 . 
classrooms^ The limitation on the number 'of students studied was due 

,to the intensive classroom observations (approximately 25 full ,days during 
the year) and teacher t^ecord keeping requirements. '^(Jeachers were re- 
quired to keep daily records of the specific time allocated to different 
content areas for each student in the intensive study.)- The students^*^ " 
were chosen from the narrower range to ensure that the stOHy concentrated 
on the learn-ing experiences of "typical fifjh graders". In addition to 
the test information described above, our. investigation also included . 
the BTES measures of Al located Time in fractions between the B and C 
test occasions, student Engagement -Rates during mathematics instruction, 

'and. the proportions of student time during jnath spent on tasks with which, 
they acliievedu'High-. "success (misled Very few -problems) and low success 
(answered* vj6ry; few f>ro|)l ems correctly) . Additional detailsxibout the' 
data set ar^ contained in the longer report in Appendix^. 

In practice, the BTES data differed in several respects from the 
data described under the ideal scenario. Typical classrooms have more 
students and most' likely a broader range of abilities. , Moreover, the 
content investigated is much narrower thfin would be typically available 



in a standardized test battery^though there were perhaps more i.tems 
devoted to fractions than one would typically find. Moreover, the full 
sample was jpre homogeneous than the fifth-grade population as a 
whole. It might al'so be the case that mathematics performance, level s 
of the classrooms was more homogeneous than typdcal distribution of 
fifth-grade classroom?. 

These departures from- the ideal both helped and hurt our empirical 
efforts. The overall sample size'was sufficifently small to allow 
thorough empirical analysis by both statistical and graphical means at 
reasonable cost. We were better able to trace particularly interesting 
results back to their source than one could with larger data sets. On 
the other hapd, the small sample restricted the power of the statistical 
tests one might perform (we were more interested in the magnitude of 
particular indices rather than their statistical significance) and 
cau§ed certain empirical indices to be 'overly sensitive to the atypical 
performance of individual students within classrooms. 

Similarly, the restriction in test content had mixed consequences. 
On the one hand, we were gratified to find that potentially important 
differences in instructional activities could be identified 'by examining 
class-level performance on items and relatively homogeneous subsets of 
items. ^ThereJ«ould seem to be clear advantages in being able to pinpoint 
instructional effects, at a level of specificity suitable for instructional 
remediation. On the other hand,. a broader array of content was never 
investigated, there is no way to determine whether the methods used 
are sensitive to instructional and program differences at a higher level, 
of generality.. Research by Madaus, Airasian, and their associates and 
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by Harnisch and Linn (1981) does suggest, however, that the methods 
studied are applicable to data covering a broader range of content. 

* 

We will not comment further on the limitations of our empirical 
work. C|eai:-ly, more empirical efforts are needed to determine just how 
useful multilevel methods can be in test development and interpretation 
in local school settings. 

Specific Analytical Procedures . As stated earlier, our empirical 
investigationof between-group program and instructional differences 
emphasized two distinct approaches ♦ In the^first approach, the empirical 
properties of five indices of item discrimination between groups were 
investigated. The merits of each index as a criterion for selecting ^ 
items during test construction were explored. Scales*were constructed 
by choosing items that exceeded a certain level on a specific index of 
^between-group item discrimination. The empirical properties of the con- 
structed scales were then examined. and compared with the characteristics 
of the 15-item fractions ta$al score. ''The five indices investigated 
were as follows: 

(a) the item intraclass correlation (the proportion of variation in 
item scores associated with between-class sources of variation); 

(b) the combination of item intra-class correlations used in con- 
junction with between-class item intefcorrelations (i.e., 
the correlations of class mean performance on one item with 

r 

"class mean performance oil other items); 

(c) the between-clas3 correlation of item performance with' total 
tSst performance (the group-level analogue of the point-biserial 
correlatio,n); 
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\d) a discriminant analysis in which items are used to (|iscriminate 

among classrooms; arrd, 
(e) the between-^roup correlation of item performance with a measure 
of instruction (in this case, time allocated to fractions 
instruction). . • 

The criteria used to judge the merits of specific indices iricluded 
the intraclass correlation of the constructed scale, the magnitude of the 
effects of Instructional variables in regression analyses with student 
performance on the constructed scale as the dependent variable and 
between-class and within-class instructional and backgrourW measures as 

explanatory variables, and the overall proportion of "variation explained" 
2 

{R ) in Student performance. The belief was that specific indices would lead 
to the construction of scales that retained between-group variation in 
t?st performance, increased the relationship of instructional variables 
to performance and required fewer test items. 

The second group of analytical' strategies Involved adapting procedures 
previously employed for examining patterns. of test itemNcesponses of in- 
dividual students to detect differences between groups (classes in this 
study) of students.' Patterns of correct item responses weref investigated 
through the generation of class-level variants of the Student-Problem 
Chart developed by Sato (1980). The properties of the mean and standard 
deviation of Sato's cautiorf index (a measure of the anomalousness of an 
individual's pattern of correct item response) as a possible stsitistical 
measure of differential instructional coverage and emphasis across class- 
rooms were also explored. Finally, the use of the patterns incorrect 
item responses as information about between-class instructional differences 
was examined. 

i o 



Resul ts 

Subsets of ^ Group Sensitive Items , . The investigation rff the five * 

« • 

alternative indices for selecting items-for , constructing scales more 
sensitive to group differences 'pointed to a number of similarities and 
differences among the^ indices. First, the indices tended to select 
slightly different subsets of items. Moreover, the items selected by 
most indices did not rispresent any clear content clusters, but rather ^ 
specific empirical nuances that aljgned the analytical foundation for 
a specific index with the characteristics of student performance. Thus, 
investigators are likely to need to use several indices to avoid basing 
item selection on special circumstances existing in a' given sample of 
classrooms and schools. ' ^ 

Second, the scales constructed by all five indices exhibited approxi- 
mately the same proportion of between-class variation (ranging from .42 
%o .50) as the total scale (.47)... This level of retention of variation 
was obtained despite one-third (10 item) and two-third (5 item) reductions 
in test lengtj)^ Obviously, focussing on indices of between-group dis- 
crimination accentuates the J)etween-class differences in item performance 
thatias the basis for their, consideration in the first pltice*. Unfortunately, 
the relationships of the scales to the fnstruc^ional and background 
variables fluctuated according to the index used for item selection. As 
might be expected,' the index based on the between-class correlation of 
the items v/ith instructional variables was most effective in building 
a scale sensitive to the variable used to select items'. '.Other differences 
were less pr^ictable, The obvious conclusion from the. analysis was 
that if investlliatqrs know the variable actording to which they wish 

r 

'It 



to distinguish performance, then selecting; items" on the basis of their 
relation to that variable is an effective strategy for -empirical item 
selection* * 

Finally, the stability of the indices was investigated by comparing 
scales formed using the data already described with the Scales formed 
from a limited set of pilot data (5 full classes containing approximately 
120 students). None of the indices of item 'discrimination between groups 
were particularly stable across samples • Different items were selected, 
the intraclass correlations for the constructed scales changed and the^ 
relation of the scale to instructional variables fluctuated. However, 
the limited number of groups in the pilot study might be at least 
partially responsible for the observed instability. . ' , 

Patterns of Item Response . The examination of between-class patterns 
of correct and incorrect nem responses indicated that the patterns of re- 
sponses were related to group membership. Moreover, since results held up 
after controlling for between-class differences on the pretest, the pattern 
of responses appears to be related to instructional coverage and emphasis. 

The patterns of correct Item response on the posttest clearly showed 
a reljationship to instructional .coverage that were not visible prior to 
instruction. For example, certain classes with only poor or average 
perfSnnance in the addition of fractions, exhibited high per-formance 

on the more difficu^lt "algebraic manipulation" topic. The differences 

<^ * 

^In'coverage- and emphasis turned gtut to be most-evident at the item. level\ ^ 
For example, st^hdents in some classrooms managed to' learn simple addition 
and subtraction of fractions with coimion denominators and virtually 
niching else. . ^ ' 

18* 
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. ^The results from the use of the class mean arid standard deviation 
on the caution index as statistical indices to detect unusual instructional 
patterns were mixed. Classrooms whose unusual Instructional coverage and 
emphasis was evident from the patterns of correct responses tended to 
have high mean caution indices. Unfortunately, there were several classes 

' in which the anomatt)us response patterii for a single student (out of 6) 
also resulted in high'niean caution indices. However, since these class- 
rooms also tended to exhibit high variability in the caution index, it 
was still possible to separate classrooms with distinctive instructional 
patterns' from those with variable student response patterns. The confusion 
of individual with group anomal<5usness should be even less likely in 
regular size classes. 

The ciass-level analysis of patterns of incorrect item responses 
was particularly informative. There were clear instances where students 
in the same classroom exhibited a common incorrect problem solving pro*- 
cedure (e.g., adding both numerator and denominator in the addition of- 
fractions). The reasons for this incorrect procejdure may be traceable 

' to inadequate instruction or simply lack of instruction when the faulty 
procedure wa$ present prior to instruction. Overall, there was considerable 
evidence that error patterns reflect both random and systematic processes 
and that systematic 'errors have both individual -specific and grdup-- 
specific determinants. . ' \ * 

Concluding Comments ^ * \ 

As with any research, the conclusions of this study are limited by 

\ 

the data employed and .further research is needed. Nevertheless, the | 
present investigation does provide support for arguments that tests can L 
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be cpnstructed in ways which are more or less sensitive to desired 

group characteristics (e.g., instructional and program- differepices) 

and investigations of group-level, patterns in test item responses can 

provide important information about the group-based differences in 

instructional experiences. 

Having concluded that the multilevel approaches to test development 
t 

and interpretation are potentially beneficial, we need to comment further 
on the conditions under which we expect these methods to J)e maximally 
useful. In prder to achieve maximum benefits from procedures for selecting 
group-sensitive items, it appears that one needs to know the specific 
characteristics whose between-group effects one wants to measure. For 
instance, it is logical to irhoose items which exhibit high relationships 
to time allocated to instruction if the intended purpose of the^ scales 
constructed from the items is to distinguish the consequenx:es (in future 
samples) of differences in instructional coverage. This is precisely the' 
basis for the item selection procedures -employed in the BTES study and 
might be used in^other instances where the intent is to .monitor the 
effects of such instructional differences. The problem is that in many 
cases, investigators do not know nor are they able to anticipate the 
characteristics of groups that are most, salient to thjsir purposes. 
Alternatively, the number of characteristics of interest may be large 
and their interactions may be complex jn natural classroom settings. 
Under these circumstance, the investigator is forced to explore a number 
of alternatives in the hope of discerning patterns of group sensitivity, 
that reflect on the questions of interest. This is likely be both a 
time-consuming .and difficult task. 
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We are less concerned that investigation of group-level patterns in 
test item performance can go awry. In fact, group-level information 
appears to be particularly well-suited for the purpose of forming ; 
decisions about instruction and program effects. We can envision providing 
teachers (and groups of teachers) with the patterns of performance for 
their own class as well as patterns for seemingly similar classrooms; 
While this class-level information may not be Sufficiently diagnostic 
about an individual Student's problems, it can potentially pinpoint for 
teachers (and groups of teachers) the consequences of their particular 
decisions about instructional coverage, emphasis, and method. As such, 
class and school level patterns of test .item performance*" would seem to 
be a valuable element of information-based program improvement activities 
.in Individual classrooms, schools, and school districts, • 

What remains, to be determined about investigations of group-level 
item response p'atterns is whether these methods become intractible once 
the number of groups^ and number of items becomes large. * We also neefd to 
know more about which special characteristics of groups (e,g,, heterogeneity 
of ability or differential instructional coveragje within classrooms) or 
items (e.g., the diversity of content, information processing requirements) 
cause ^examinations of response patterns to be more or less fruitful. 
There is also a question of how the amount of information and the method 
of reporting it affects the usefulness of these procedures for specTfi^ 
audiences (e,g,, teachers; principals, administrators, evaluators). 
While the successful results from examinations of graphical procedures is 
heartening, there are clearly limits on how far one can go before even 
the simplest form of data display becomes an unintelligible blur for the 
praptitioheri 



Given the above concerns, the next phase in this investigation of 

multilevel methods for test development and- interpretation shoulS be 
obvious. It is time to investigate the utility of these multilevel methods 

in actual testing and test reporting procedures in schools and school 

districts. Studies in such contexts are necessary to identify the boundaries 

of the practical applications of a multilevel perspective toward teSt 

usage in local school improvement efforts. 
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^ ' FOOTNOTES • " 

(1) We do not intentionally ignore the role of the home in this con- 
ception. However, school systems have the responsibility of cdrnriun-; 
icating their educational goals to parents and providing them a means 
for participating in the education of their children. Moreover, 
schools cannot abdicate their responsibilities in the development 

of a well-educated cit2enry simply because of shortcomings in the 
. home. - . - 

(2) The scenario need not be restrifcted to the school district level 
and below, especially when broader curriculum and pragram evaluation 
issues are at stake. However, it seems unlikely that the kinds of 
program and instructional improvements of interest here can be 
reasonably accomplished through examination of higher-level data except - 
to the extent that a given district judges its performance by com- 
parison with other districts. The form of signal reflected by district- 
level data is almost invariably at least a step removed from the level 
where program and instructional. changes can be implemented. It is' 

at the school -building level and below where instructional manage- / 
ment occurs. Thus, we have concentrated our efforts on methods for 
using test information at. the level of school and classroom. We 
return to this issue later on. 
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